library(tidyverse)
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
library(readxl)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(class)
library(stringr)
library(matlab)
##
## Attaching package: 'matlab'
## The following object is masked from 'package:stats':
##
## reshape
## The following objects are masked from 'package:utils':
##
## find, fix
## The following object is masked from 'package:base':
##
## sum
cars <- read_csv('OLX_Car_Data_CSV.csv')
## Parsed with column specification:
## cols(
## Brand = col_character(),
## Condition = col_character(),
## Fuel = col_character(),
## `KMs Driven` = col_double(),
## Model = col_character(),
## Price = col_double(),
## `Registered City` = col_character(),
## `Transaction Type` = col_character(),
## Year = col_double()
## )
I decided to work with a dataset from Kaggle called “Pakistan Used Cars”: https://www.kaggle.com/karimali/used-cars-data-pakistan.
I ended up trying to predict the brand of a car based upon 1) its price and 2) the number of kilometers that the car had driven. I was surprisingly successful! Although my knn models were a bit lacking… more on that later.
#clean out na rows
cars.clean <- cars %>%
na.omit()
#create my training set; used >50% of the data;
#selected only columns Price and KMs Driven
train <- cars.clean[, c(4,6)] %>%
head(12500)
#create my testing set
test <- cars.clean[-c(1:12500), c(4,6)]
#create true classification labels;
#these are the corresponding correct values for the train set
pred_labels <- cars.clean[c(1:12500), 1]
#run knn to create my model! I used the square root of n for
#a k-value in this case as I found this was a good rule of thumb.
pred_model <- knn(train, test, pred_labels$Brand, k = 112)
#Next I selected the actual, correct Brand values for my test set;
#I compare these to the predictions my knn function made.
test.actuals <- cars.clean[-c(1:12500), 1]
#organize my data
prop.set <- data.frame(predictions = pred_model,
actuals = test.actuals$Brand)
#calculating the proportion of correct classifications
prop.correct <- mean(as.character(prop.set$predictions) == as.character(prop.set$actuals))
By using KNN with a k-value of 112, I found that I could predict the brand of the car correctly about 57% of the time. Pretty good!
#Creating a sample of my data to make LOOCV more time-realistic
cars.clean.sample <- cars.clean[sample(nrow(cars.clean), 1000), ]
#initialize empty arrays to store generated values
LOOCV.prop <- NULL
pred <- NULL
for(k in 1:112) {
for(i in 1:nrow(cars.clean.sample)){
#remove ith row and use for testing
current.test <- cars.clean.sample[i, c(4,6)]
#use remaining data for training
current.train <- cars.clean.sample[-i, c(4,6)]
current.train.labels <- cars.clean.sample[-i, 1]
#do knn with above 2 data sets
pred[i] <- as.character(knn(current.train, current.test, current.train.labels$Brand, k))
}
#Calculate the accuracy of knn for given k value
LOOCV.prop[k] <- mean(as.character(pred) == as.character(cars.clean.sample$Brand))
print(k)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
## [1] 41
## [1] 42
## [1] 43
## [1] 44
## [1] 45
## [1] 46
## [1] 47
## [1] 48
## [1] 49
## [1] 50
## [1] 51
## [1] 52
## [1] 53
## [1] 54
## [1] 55
## [1] 56
## [1] 57
## [1] 58
## [1] 59
## [1] 60
## [1] 61
## [1] 62
## [1] 63
## [1] 64
## [1] 65
## [1] 66
## [1] 67
## [1] 68
## [1] 69
## [1] 70
## [1] 71
## [1] 72
## [1] 73
## [1] 74
## [1] 75
## [1] 76
## [1] 77
## [1] 78
## [1] 79
## [1] 80
## [1] 81
## [1] 82
## [1] 83
## [1] 84
## [1] 85
## [1] 86
## [1] 87
## [1] 88
## [1] 89
## [1] 90
## [1] 91
## [1] 92
## [1] 93
## [1] 94
## [1] 95
## [1] 96
## [1] 97
## [1] 98
## [1] 99
## [1] 100
## [1] 101
## [1] 102
## [1] 103
## [1] 104
## [1] 105
## [1] 106
## [1] 107
## [1] 108
## [1] 109
## [1] 110
## [1] 111
## [1] 112
This nice histogram shows how different values of k result in better or worse classifications of car brand. The max tells me that in this sample about 55% of the classifications are accurate with some k. Unfortunately I had trouble filtering the data, but by looking at LOOCV.prop I can easily find that this value occurs when k = 24.
hist(LOOCV.prop)
max(LOOCV.prop)
## [1] 0.567
view(LOOCV.prop)
I decide to use 24 as a start value to create my predictions. I had to write a function to normalize my numeric predictors and return a prediction value in the form of a Brand.
#Creating scaled versions of variables and then writing a function for testing
#mode function
MaxTable <- function(x){
dd <- unique(x)
dd[which.max(tabulate(match(x,dd)))]
}
#scaled variables
cars.clean.sample.new <- cars.clean.sample %>%
mutate(km.scaled = (`KMs Driven` - mean(`KMs Driven`))/sd(`KMs Driven`),
price.scaled = (Price - mean(Price))/sd(Price))
#function
cars.scaled <- function(test.km, test.price, k) {
test.km.scaled <- (test.km - mean(cars.clean.sample$`KMs Driven`))/sd(cars.clean.sample$`KMs Driven`)
test.price.scaled <- (test.price - mean(cars.clean.sample$Price))/sd(cars.clean.sample$Price)
cars.clean.sample.new %>%
mutate(distance = ((km.scaled - test.km.scaled)^2 + (price.scaled - test.price.scaled)^2) ) %>%
arrange(distance) %>%
head(k) %>%
summarize(knn.result = MaxTable(Brand)) %>%
.[[1]] %>%
return()
}
Next I created some grids of values in order to create visualizations that can show how knn created classifications for the car data. I tried to make these grid limits so that most of the data, with the exception of outliers, would be visible somewhere on the grid.
km.grid <- seq(from = 1, to = 500000, by = 10000)
price.grid <- seq(from = 50000, to = 3000000, by = 50000)
grid <- expand.grid(km.grid, price.grid)
Finally I use my function and the grid values to create predictions for the brand of cars with different combinations of price and kilometers driven. By overlaying a plot of the actual data, I can see how the model created classifications that accurately classify a good portion of my data.
knn.k24 <- grid %>%
group_by(Var1, Var2) %>%
mutate(prediction = cars.scaled(Var1, Var2, 24))
#filtering data to make visualizations a bit clearer:
#there are so many brands that the graph can easily get cluttered;
#additionally, my knn model only predicted three classifications,
#namely Suzuki, Toyota, and Honda
cars.clean.sample.new.filtered <- cars.clean.sample.new %>%
filter(Price <= 3000000) %>%
filter(`KMs Driven` <= 500000) %>%
filter(Brand %in% c('Suzuki', 'Toyota', 'Honda'))
knn.k24 %>%
ggplot(aes(x = Var1,
y = Var2)) +
geom_point(aes(color = factor(prediction)),
size = 2,
alpha = 0.3) +
geom_point(data = cars.clean.sample.new.filtered,
mapping = aes(x = `KMs Driven`,
y = Price,
color = factor(cars.clean.sample.new.filtered$Brand,
levels = c('Toyota', 'Suzuki', 'Honda')),
size = 1))
I’ll display a few different k-values to provide comparison.
knn.k1 <- grid %>%
group_by(Var1, Var2) %>%
mutate(prediction = cars.scaled(Var1, Var2, 1))
knn.k1 %>%
ggplot(aes(x = Var1,
y = Var2)) +
geom_point(aes(color = factor(prediction)),
size = 2,
alpha = 0.3) +
geom_point(data = cars.clean.sample.new.filtered,
mapping = aes(x = `KMs Driven`,
y = Price,
color = factor(cars.clean.sample.new.filtered$Brand,
levels = c('Toyota', 'Suzuki', 'Honda')),
size = 1))
Try k = 10.
knn.k10 <- grid %>%
group_by(Var1, Var2) %>%
mutate(prediction = cars.scaled(Var1, Var2, 10))
knn.k10 %>%
ggplot(aes(x = Var1,
y = Var2)) +
geom_point(aes(color = factor(prediction)),
size = 2,
alpha = 0.3) +
geom_point(data = cars.clean.sample.new.filtered,
mapping = aes(x = `KMs Driven`,
y = Price,
color = factor(cars.clean.sample.new.filtered$Brand),
size = 1))
Try k = sqrt(n) = 112 (approx)
knn.k112 <- grid %>%
group_by(Var1, Var2) %>%
mutate(prediction = cars.scaled(Var1, Var2, 112))
knn.k112 %>%
ggplot(aes(x = Var1,
y = Var2)) +
geom_point(aes(color = factor(prediction)),
size = 2,
alpha = 0.3) +
geom_point(data = cars.clean.sample.new.filtered,
mapping = aes(x = `KMs Driven`,
y = Price,
color = factor(cars.clean.sample.new.filtered$Brand,
levels = c('Toyota', 'Suzuki','Honda')),
size = 1))
Such nice visualizations. Sub-optimal values of k certainly changed the visualizations; for example, choosing k = 1 allows me to see more classification zones for different brands, although in this context I don’t think this will help the accuracy of the knn algorithm. Using knn = 112 which is approximately the square root of n, I find that Hondas, Toyotas, and Suzukis are the most prominent brands in the data set. I wouldn’t say that any of the boundaries I saw were really that intuitive… I guess it makes sense that a Toyota costs more than a Honda which costs more than a Suzuki, in general.
First and foremost, one disadvantage is that I would not argue that kilometers driven and price are very good numerical predictors of the brand of a car; they were surprisingly good, and they were the only two numerical predictors that really made much sense in my data set, but I think there are probably better predictors out there; maybe adding age of the car for example could have made my models better.
Second, knn seemed to simplify this problem in order to return a high accuracy in predicting the brand of the car. Over 60% of the cars in the data set were either Toyotas or Suzukis; accordingly, knn made the Toyota and Suzuki prediction regions the largest in the grid visualizations. In fact, with the optimum k value of 24, knn only predicted 3 brands of car; Toyota, Suzuku, and Honda. Although the accuracy of predictions was about 54%, the algorithm neglected the majority of the over 250 car brands in the data set and kind of cheated in order to achieve a high accuracy. This seems to be a major short-coming of knn; in a data set like this in which there are a very large number of factors, the algorithm has trouble predicting for values that are not highly represented. They get lost.
The advantages to knn are that it was relativley easy to implement and did a surprisingly good job ‘predicting’ the brand of a car based upon two simplistic numerical indicators. The grid visualizations are quite detailed for k values that are small. Most notably, I do think knn told me something about my data that regression could not have in the same way. Using my optimum k-value, I did learn something about Hondas, Suzukis, and Toyotas in Pakistan. Of the used cars in Pakistan, Hondas, Suzukis, and Toyotas have all been driven about the same distance (approx 7500 km), and Toyotas are generally more expensive than Hondas which are more expensive than Suzukis. The one grid visualization that I created shows this is a concise and attractive way that seems unique to knn. I suppose regression certainly could have helped me learn this… but I was able to perform this analysis quickly with a number of different factors and implement them all in a single graphic. I’d say that’s a win for knn.
Overall, I was surprised that knn could predict the brand of the car with a decent degree of accuracy. With more data I bet the algorithm could do a lot better.